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Abstract — New families of Fisher information and entropy 
power inequalities for sums of independent random variables 
are presented. These inequalities relate the information in the 
sum of n independent random variables to the information 
contained in sums over subsets of the random variables, for an 
arbitrary collection of subsets. As a consequence, a simple proof 
of the monotonicity of information in central limit theorems is 
obtained, both in the setting of i.i.d. summands as well as in the 
more general setting of independent summands with variance- 
standardized sums. 

Index Terms — Central limit theorem; entropy power; infor- 
mation inequalities. 



I. Introduction 

LET Xi, X2, ■ ■ ■ , X n be independent random variables 
with densities and finite variances. Let H denote the 
(differential) entropy, i.e., if / is the probability density 
function of X, then H(X) = -E[logf(X)}. The classical 
entropy power inequality of Shannon [36] and Stam [39] states 



e 2H(X 1 + ...+X n ) > V^ e 2ff(X, 

7=1 



(1) 



In 2004, Artstein, Ball, Barthe and Naor [1] (hereafter denoted 
by ABBN [1]) proved a new entropy power inequality 



-Xn 



> 



1 - 



2H £, 



X 



(2) 



where each term involves the entropy of the sum of n— 1 of the 
variables excluding the z-th, and presented its implications for 
the monotonicity of entropy in the central limit theorem. It is 
not hard to see, by repeated application of (O for a succession 
of values of n, that (O in fact implies the inequality and 
hence (fl}. We will present below a generalized entropy power 
inequality for subset sums that subsumes both (f2|i and (Q} and 
also implies several other interesting inequalities. We provide 
simple and easily interpretable proofs of all of these inequal- 
ities. In particular, this provides a simplified understanding 
of the monotonicity of entropy in central limit theorems. A 
similar independent and contemporaneous development of the 
monotonicity of entropy is given by Tulino and Verdu [42]. 
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Our generalized entropy power inequality for subset sums 
is as follows: if C is an arbitrary collection of subsets of 
{1, 2,...,n}, then 

' ■".■■■«. (3) 
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where r is the maximum number of sets in C in which any 
one index appears. In particular, note that 

1) Choosing C to be the class C\ of all singletons yields 
r = 1 and hence ((TJ. 

2) Choosing C to be the class C„_i of all sets of n — 1 
elements yields r = n — 1 and hence ©. 

3) Choosing C to be the class C m of all sets of m elements 
yields r = ("Z i) and hence the inequality 



(4) 



\m-iJ sec m k v ies ' ' 



4) Choosing C to be the class of all sets of fc consecutive 
integers yields r = min {k,n + 1 — fc} and hence the 
inequality 



cxp <^ 2H(X 1 



X r , 



> 



1 



min {fc, n + 1 — fc} 



exp 
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In general, the inequality ^ clearly yields a whole family of 
entropy power inequalities, for arbitrary collections of subsets. 
Furthermore, equality holds in any of these inequalities if and 
only if the Xi are normally distributed and the collection C is 
"nice" in a sense that will be made precise later. 

These inequalities are relevant for the examination of mono- 
tonicity in central limit theorems. Indeed, if X\ and Xi 
are independent and identically distributed (i.i.d.), then ((TJ is 
equivalent to 



H 



X!+X 2 

V2 



> H(Xx), 



(5) 



by using the scaling H(aX) = H(X) + log \a\. This fact im- 
plies that the entropy of the standardized sums Y n — ^ ' 
for i.i.d. Xi increases along the powers-of-2 subsequence, i.e., 
H(Y 2 k) is non-decreasing in fc. Characterization of the change 
in information quantities on doubling of sample size was used 
in proofs of central limit theorems by Shimizu [37], Brown 
[8], Barron [4], Carlen and Soffer [10], Johnson and Barron 
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[21], and Artstein, Ball, Barthe and Naor [2]. In particular, 
Barron [4] showed that the sequence {H(Y n )} converges to 
the entropy of the normal; this, incidentally, is equivalent to the 
convergence to of the relative entropy (Kullback divergence) 
from a normal distribution. ABBN [1] showed that H{Y n ) is 
in fact a non-decreasing sequence for every n, solving a long- 
standing conjecture. In fact, (f2]i is equivalent in the i.i.d. case 
to the monotonicity property 



X r , 



> H 



X n - 



(6) 



Note that the presence of the factor n — 1 (rather than n) in 
the denominator of (ffj is crucial for this monotonicity. 

Likewise, for sums of independent random variables, the 
inequality © is equivalent to "monotonicity on average" 
properties for certain standardizations; for instance, 



H 



X 1 



X n 



> 



1 



E 

sec m 



H 



A similar monotonicity also holds, as we shall show, for arbi- 
trary collections C and even when the sums are standardized 
by their variances. Here again the factor r (rather than the 
cardinality \C\ of the collection) in the denominator of (01 for 
the unstandardized version is crucial. 

Outline of our development. 

We find that the main inequality (01 (and hence all of 
the above inequalities) as well as corresponding inequalities 
for Fisher information can be be proved by simple tools. 
Two of these tools, a convolution identity for score functions 
and the relationship between Fisher information and entropy 
(discussed in Section [ID, are familiar in past work on infor- 
mation inequalities. An additional trick is needed to obtain the 
denominator r in (0. This is a simple variance drop inequality 
for statistics expressible via sums of functions of subsets of a 
collection of variables, particular cases of which are familiar 
in other statistical contexts (as we shall discuss). 

We recall that for a random variable X with differentiable 
density /, the score function is p(x) = J| log f{x), the score 
is the random variable p(X), and its Fisher information is 
I(X) = E[p\X)]. 

Suppose for the consideration of Fisher information that 
the independent random variables X\ , . . . , X n have absolutely 
continuous densities. To outline the heart of the matter, the first 
step boils down to the geometry of projections (conditional 
expectations). Let p tot be the score of the total sum Y2%=i X i 
and let p s be the score of the subset sum J2ies X i- ^ s we 
recall in Section [II] p tot is the conditional expectation (or L 2 
projection) of each of these subset sum scores given the total 
sum. Consequently, any convex combination Xsec w sPs also 
has projection 



Ptot 



E 



^w s ps E X 



sec 



and the Fisher information I(Xi 
the bound 



Xr, 



E[p\ 



tot] 



< E 



E 

sec 
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E\pU has 



(7) 



For non-overlapping subsets, the independence and zero mean 
properties of the scores provide a direct means to express the 
right side in terms of the Fisher informations of the subset 
sums (yielding the traditional Blachman [7] proof of Stam's 
inequality for Fisher information). In contrast, the case of over- 
lapping subsets requires fresh consideration. Whereas a naive 
application of Cauchy-Schwarz would yield a loose bound of 

|C|E: 



sec w s 



Epg, instead a variance drop lemma yields that 
the right side of (O is not more than r^] S6C w^Epl* if each 
i is in at most r subsets of C. Consequently, 



I{Xi 



X n )<rJ2™ 2 sI[Y, X 



sec 



ies 



(8) 



for any weights ws that add to 1 over all subsets s C 
{1, . . . , n} in C. See Sections HH and HVl for details. Optimizing 
over w yields an inequality for inverse Fisher information 
that extends the Fisher information inequalities of Stam and 
ABBN: 

1 lv^ 1 

-Ett-^- w 



I(Xx 



+ X n )~r 



sec 
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Alternatively, using a scaling property of Fisher information 
to re-express our core inequality ([§), we see that the Fisher 
information of the total sum is bounded by a convex combi- 
nation of Fisher informations of scaled subset sums: 



I{Xi 



X n )<J2 wbI 
sec 



Eies X 



(10) 



This integrates to give an inequality for entropy that is an 
extension of the "linear form of the entropy power inequality" 
developed by Dembo et al [15]. Specifically we obtain 



H{X X 



X n 



sec 



(11) 



See Section [V] for details. Likewise using the scaling property 
of entropy on (fTTb and optimizing over w yields our extension 
(O of the entropy power inequality, described in Section [VT] 
To relate this chain of ideas to other recent work, we 
point out that ABBN [1] was the first to show use of a 
variance drop lemma for information inequality development 
(in the leave-one-out case of the collection C„_i). For that 
case what is new in our presentation is going straight to 
the projection inequality (0 followed by the variance drop 
inequality, bypassing any need for the elaborate variational 
characterization of Fisher information that is an essential in- 
gredient of ABBN [ 1 ] . Moreover, whereas ABBN [ 1 ] requires 
that the random variables have a C 2 density for monotonicity 
of Fisher information, absolute continuity of the density (as 
minimally needed to define the score function) suffices in our 
approach. Independently and contemporaneously to our work, 
Tulino and Verdu [42] also found a similar simplified proof of 
monotonicity properties of entropy and entropy derivative with 
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a clever interpretation (its relationship to our approach is de- 
scribed in Section |nl] after we complete the presentation of our 
development). Furthermore, a recent preprint of Shlyakhtenko 
[38] proves an analogue of the information inequalities in the 
leave-one-out case for non-commutative or "free" probability 
theory. In that manuscript, he also gives a proof for the 
classical setting assuming finiteness of all moments, whereas 
our direct proof requires only finite variance. Our proof also 
reveals in a simple manner the cases of equality in ©, for 
which an alternative approach in the free probability setting is 
in Schultz [35]. 

While Section [HI] gives a direct proof of the monotonicity 
of entropy in the central limit theorem for i.i.d. summands, 
Section lVIll applies the preceding inequalities to study sums of 
non-identically distributed random variables under appropriate 
scalings. In particular, we show that "entropy is monotone on 
average" in the setting of variance-standardized sums. 

Our subset sum inequalities are tight (with equality in the 
Gaussian case) for balanced collections of subsets, as will 
be defined in Section |ll] In Section IVIIII we present refined 
versions of our inequalities that can even be tight for certain 
unbalanced collections. 

Section [IX] concludes with some discussion on potential 
directions of application of our results and methods. In particu- 
lar, beyond the connection with central limit theorems, we also 
discuss potential connections of our results with distributed 
statistical estimation, graph theory and multi-user information 
theory. 

Form of the inequalities. 

If ip(X) represents either the inverse Fisher information or 
the entropy power of X, then our inequalities above take the 
form 

n/>(X 1 + ... + X n ) >J2^(j2 X ^)- (12) 
sec ^ ies ' 

We motivate the form ( TT2l > using the following almost trivial 
fact. 



Fact 1: For arbitrary numbers {m : i = 1, 2, . . . , n}, 

n 

^2^2 a i = r ^2^i ( i3 ) 

sec ies i=i 
if each index i appears in C the same number of times r. 



Indeed, 



sec ies i=i sgi.sec 

n n 



i=l i=l 

If Fact[T]is thought of as C-additivity of the sum function for 
real numbers, then (O and (0 represent the C-superadditivity 
of inverse Fisher information and entropy power functionals 



respectively with respect to convolution of the arguments. 
In the case of normal random variables, the inverse Fisher 
information and the entropy power equal the variance. Thus 
in that case (0 and (0 become Fact Q] with a; equal to the 
variance of Xj. 

II. Score Functions and Projections 

We use p{x) — yj^y to denote the (almost everywhere de- 
fined) score function of the random variable X with absolutely 
continuous probability density function /. The score p(X) has 
zero mean, and its variance is just the Fisher information I(X). 

The first tool we need is a projection property of score 
functions of sums of independent random variables, which 
is well-known for smooth densities (cf., Blachman [7]). For 
completeness, we give the proof. As shown by Johnson and 
Barron [21], it is sufficient that the densities are absolutely 
continuous; see [21][Appendix 1] for an explanation of why 
this is so. 



Lemma 1 (CONVOLUTION IDENTITY FOR SCORES): If V\ 
and V 2 are independent random variables, and V\ has an 
absolutely continuous density with score p±, then V\ + V 2 has 
the score function 



p(v)=E[p 1 (V 1 )\V 1 + V 2 = 



(14) 



Proof: Let /1 and / be the densities of V\ and V = V\ + 
V 2 respectively. Then, either bringing the derivative inside the 
integral for the smooth case, or via the more general formalism 
in [21], 



f'(v) = ^-E[f 1 (v 
ov 



so that 



p(v) = 



-E[fx(v 

--E[h{v- 

m 
m 

E \Hv 



V2)} 

V 2 )pi(v - 



v 2 )} 



v 2 ) 



m 

= E[ Pl {V l )\V 1 



pi(v - V 2 ) 
■V 2 =v). 



The second tool we need is a "variance drop lemma", the 
history of which we discuss in remarks after the proof below. 
The following conventions are useful: 

• [n] is the index set {1,2,..., n}. 

• For any s c [n], Xs stands for the collection of random 
variables (Xi : i € s), with the indices taken in their 
natural (increasing) order. 



For -0s 



?|S| 



we write tps f° r a function of 



x s for any s C [n], so that ip s {x s ) = 0s(a;fci , • ■ ■ , x k , s ), 
where k\ < k 2 < ■ ■ ■ < fcigi are the ordered indices in s. 
We say that a function U : M™ — > K is C-additive if it 
can be expressed in the form X)sec V's(^s)- 
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The following notions are not required for the inequalities 
we present, but help to clarify the cases of equality. 

• A collection C of subsets of [n] is said to be discriminat- 
ing if for any distinct indices i and j in [n], there is a set 
in C that contains i but not j. Note that all the collections 
introduced in Section H] were discriminating. 

• A collection C of subsets of [n] is said to be balanced if 
each index i in [n] appears in the same number (namely, 
r) of sets in C. 

• A function / : R d — > R is additive if there exist functions 
fi : R -> R such that /(xi, . . . ,x d ) = £\ =1 /ifo), i.e., 
if it is Ci -additive. 



Lemma 2 (VARIANCE DROP): Let U(Xi, ...,X n ) = 
2~2seC ^(Xs) be a C-additive function with mean zero 
components, i.e., Eip s (X s ) = for each s E C. Then 



EU 2 <rJ2E{MXs)}\ 



(15) 



sec 



where r is the maximum number of subsets s 6 C in which any 
index appears. When C is a discriminating collection, equality 
can hold only if each ips is an additive function. 

Proof: For every subset t of [n], let E^ be the ANOVA 
projection onto the space of functions of X^ (see the Appendix 
for details). By performing the ANOVA decomposition on 
each ips> we have 



^sectcs 

t SDt.sec 



(16) 



= EM E ^t^s(^s) 

t SDt.sec 

using the orthogonality of the ANOVA decomposition of U in 
the last step. 

Recall the elementary fact (Y^Li Vi) 2 — m 2~2iLi vh which 
follows from the Cauchy-Schwarz inequality. In order to apply 
this observation to the preceding expression, we estimate the 
number of terms in the inner sum. The outer summation over t 
can be restricted to non-empty sets t, since E^ has no effect in 
the summation due to ?As having zero mean. Thus, any given 
t in the expression has at least one element, and the sets sDt 
in the collection C must contain it; so the number of sets s 
over which the inner sum is taken cannot exceed r. Thus we 
have 

EU 2 <Y^r E E(E t MXs)) 2 
t SDt, sec 

= '-EE £ (^t^w) 2 d7) 

sectcs 
sec 



by rearranging the sums and using the orthogonality of the 
ANOVA decomposition again. This proves the inequality. 

Now suppose ips' is not additive. This means that for some 
set t C s' with two elements, E^ips' (Xs') ^ 0. Fix this choice 
of t. Since C is a discriminating collection, not all of the at 
most r subsets containing one element of t can contain the 
other. Consequently, the inner sum in the inequality ( TTTb runs 
over strictly fewer than r subsets s, and the inequality (fl7l ) 
must be strict. Thus each ips must be an additive function if 
equality holds, i.e., it must be composed only of main effects 
and no interactions. ■ 



Remark 1: The idea of the variance drop inequality goes 
back at least to Hoeffding's seminal work [18] on (/-statistics. 
Suppose ip ■ R' Tl — > R is symmetric in its arguments, and 
£/0(Xl, . . . , X m ) = 0. Define 

(7(X 1 ,...,X n ) = -4 Y E (18) 



{SC[n]:|S|=m} 



Then Hoeffding [18] showed 



EU 2 < -Eip 2 , 



(19) 



which is implied by Lemma [2] under the symmetry assump- 
tions. In statistical language, U defined in ( fT8l is a (7-statistic 
of degree m with symmetric, mean zero kernel tp that is 
applied to data of sample size n. Thus (fT9b quantitatively 
captures the reduction of variance of a (7-statistic when sample 
size n increases. For m = 1, this is the trivial fact that the 
empirical variance of a function based on i.i.d. samples is 
the actual variance scaled by wT 1 . For m > 1, the functions 
ip(Xs) are no longer independent, nevertheless the variance of 
the (7-statistic drops by a factor of — . Our proof is valid for the 
more general non-symmetric case, and also seems to illuminate 
the underlying statistical idea (the ANOVA decomposition) as 
well as the underlying geometry (Hilbert space projections) 
better than Hoeffding's original combinatorial proof. In [16], 
Efron and Stein assert in their Comment 3 that an ANOVA- 
like decomposition "yields one-line proofs of Hoeffding's 
important theorems 5.1 and 5.2"; presumably our proof of 
Lemma [2] is a generalization of what they had in mind. As 
mentioned before, the application of such a variance drop 
lemma to information inequalities was pioneered by ABBN 
[1]. They proved and used it in the case C = C n _j using clear 
notation that we adapt in developing our generalization above. 
A further generalization appears when we consider refinements 
of our main inequalities in Section IVIIII 

The third key tool in our approach to monotonicity is 
the well-known link between Fisher information and entropy, 
whose origin is the de Bruijn identity first described by Stam 
[39]. This identity, which identifies the Fisher information as 
the rate of change of the entropy on adding a normal, provides 
a standard way of obtaining entropy inequalities from Fisher 
information inequalities. An integral form of the de Bruijn 
identity was proved by Barron [4]. We express that integral 
in a form suitable for our purpose (cf., ABBN [1] and Verdu 
and Guo [43]). 
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Lemma [2] yields 



Lemma 3: Let X be a random variable with a density and 

arbitrary finite variance. Suppose X t = X + y/tZ, where Z 
is a standard normal independent of X. Then, 

_ i. ... ( -'>-, 1 — -L / It V .5 _ ' 



H(X) = i log(27re) 



1 + i 



eft. (20) 



Proof: In the case that the variances of Z and X match, 
equivalent forms of this identity are given in [3] and [4]. 
Applying a change of variables using t = tv to equation 
(2.23) of [3] (which is also equivalent to equation (2.1) of 
[4] by another change of variables), one has that 

H(X) = ilog(27reu) ' ' " " ' 



I(X t 



v + 1 



dt, 



if X has variance v and Z has variance 1. This has the 
advantage of positivity of the integrand but the disadvantage 
that it seems to depend on v. One can use 

1 1 



\ogv 



1+t v + t 



dt 



to re-express it in the form (l20b . which does not depend on v. 



III. MONOTONICITY IN THE IID CASE 

For clarity of presentation of ideas, we focus first on the 
i.i.d. setting. For i.i.d. summands, inequalities (|2]i and (0]i 
reduce to the monotonicity property H(Y n ) > H(Y m ) for 
n > m, where 



1 

Y n = —= \ Xi 

V n , 
v i— i 



(21) 



We exhibit below how our approach provides a simple proof 
of this monotonicity property, first proved by ABBN [1] using 
somewhat more elaborate means. We begin by showing the 
monotonicity of the Fisher information. 



Proposition 1 (Monotonicity of Fisher information): 
If {Xi} are i.i.d. random variables, and Y n -\ has an absolutely 
continuous density, then 



I{Y n ) < I(Y n -l), 

with equality iff X\ is normal or I(Y n ) = oo. 



(22) 



Proof: We use the following notation: The (unnormal- 
ized) sum is S n = Yli£[n]X, anc ^ tne l eave -° ne -out sum 
leaving out X, is = J2i^j X t . Setting p(S n ) to be the 
score of S n and pj to be the score of S^K we have by 
Lemma[T]that p(S n ) = E[pj\S n ] for each j, and hence 

p(S n ) =E±Y,Pi Sn 

Since the norm of the score is not less than that of its 
projection (i.e., by the Cauchy-Schwarz inequality), 

1 



I(S n )=E[p 2 (S n )} <E 



j6[n] 



i£[»] 
so that 



n - 1 

I{S n ) < I(S n -i). 

n 

r/\ _ 1 



If X' = aX, then px'(X') = ±px{X) and a 2 I(X') = I{X); 
hence 

I(Y n ) = nI(S n ) < (n - l)/(Sn-i) - I(Y n -i)- 

The inequality implied by Lemma|2]can be tight only if each 
pj, considered as a function of the random variables X t , i ^ j, 
is additive. However, we already know that pj is a function of 
the sum of these random variables. The only functions that are 
both additive and functions of the sum are linear functions of 
the sum; hence the two sides of (l22l can be finite and equal 
only if each of the scores pj is linear, i.e., if all the Xi are 
normal. It is trivial to check that X\ normal or I(Y n ) = oo 
imply equality. ■ 

The monotonicity result for entropy in the i.i.d. case now 
follows by combining Proposition Q] and Lemma [3] 



Theorem 1 (MONOTONICITY OF ENTROPY: IID Case): 
Suppose {Xi} are i.i.d. random variables with densities and 
finite variance. If the normalized sum Y n is defined by ( TZTl ). 
then 

H(Y n ) > H{Y n -i). 
The two sides are finite and equal iff X\ is normal. 

After the submission of these results to ISIT 2006 [28], 
we became aware of a contemporaneous and independent 
development of the simple proof of the monotonicity fact 
(Theorem [TJ by Tulino and Verdu [42]. In their work they 
take nice advantage of projection properties through minimum 
mean squared error interpretations. It is pertinent to note that 
the proofs of Theorem 1 (in [42] and in this paper) share 
essentials, because of the following observations. 

Consider estimation of a random variable X from an obser- 
vation Y = X + Z in which an independent standard normal Z 
has been added. Then the score function of Y is related to the 
difference between two predictors of X (maximum likelihood 
and Bayes), i.e., 

- p{Y) =Y - E[X\Y], (23) 

and hence the Fisher information I(Y) = Ep 2 (Y) is the 
same as the mean square difference E[(Y — E[X\Y]) 2 ], or 
equivalently, by the Pythagorean identity, 



I(Y) = Var(Z) - E[(X - E[X\Y)) 2 }. 



(24) 



Thus the Fisher information (entropy derivative) is related to 
the minimal mean squared error. These (and more general) 
identities relating differences between predictors to scores and 
relating their mean squared errors to Fisher informations are 
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developed in statistical decision theory in the work of Stein and 
Brown. These developments are described, for instance, in the 
point estimation text by Lehmann and Casella [25] [Chapters 
4.3 and 5.5], in their study of Bayes risk, admissibility and 
minimaxity of conditional means J5[X"|y]. 

Tulino and Verdu [42] emphasize the minimal mean squared 
error property of the entropy derivative and associated projec- 
tion properies that (along with the variance drop inequality 
which they note in the leave-one-out case) also give Proposi- 
tion Q] and Theorem Q] That is a nice idea. Working directly 
with the minimal mean squared error as the entropy derivative 
they bypass the use of Fisher information. In the same manner 
Verdu and Guo [43] give an alternative proof of the Shannon- 
Stam entropy power inequality. If one takes note of the above 
identities one sees that their proofs and ours are substantially 
the same, except that the same quantities are given alternative 
interpretations in the two works, and that we give extensions 
to arbitrary collections of subsets. 

IV. Fisher Information Inequalities 
In this section, we demonstrate our core inequality (8). 



Proposition 2: Let {Xj} be independent random variables 
with densities and finite variances. Define 



T n = ^ X 



and 



(25) 



i£S 



for each s S C, where C is an arbitrary collection of subsets of 
[n]. Let w be any probability distribution on C. If each T^ s ' 
has an absolute continuous density, then 



I(T n )<rJ2w 2 s I(T 



(sh 



(26) 



sec 



where wg = w({s}). When C is discriminating, both sides can 
be finite and equal only if each X; is normal. 

Proof: Let ps be the score 

of t( s ). We proceed in accor- 
dance with the outline in the introduction. Indeed, Lemma Q] 
implies that p(T n ) = E[ps\T n ] for each s. Taking a convex 
combinations of these identities gives, for any {ws} such that 

Esec w s = 1. 



P( T n) = ^2 w s E[p s \T n ] ^ w s p s 

sec '-sec 

Applying the Cauchy-Schwarz inequality, 

2 



p 2 {T n ) < E 



^sec 



(27) 



(28) 



Taking the expectation and then applying Lemma [2] in succes- 
sion, we get 

2 

(29) 



< r J2sec E ( w sPs) 



Sec ™i/(T( S )), 



(30) 
(31) 



as desired. The application of Lemma [2] can yield equality 
only if each piT^') is additive; since the score p(T^ s ') is 
already a function of the sum T^ s \ it must in fact be a linear 
function, so that each X, must be normal. ■ 

Naturally, it is of interest to minimize the upper bound of 
Proposition|2]over the weighting distribution w, which is easily 
done either by an application of Jensen's inequality for the 
reciprocal function, or by the method of Lagrange multipli- 
ers. Optimization of the bound implies that Proposition [2] is 
equivalent to the following Fisher information inequalities. 



Theorem 2: Let {Xi} be independent random variables 
such that each has an absolutely continuous density. Then 

1 . 1^ 1 



I(T n ) 



> 



sec 



(32) 



When C is discriminating, the two sides are positive and equal 
iff each X, is normal and C is also balanced. 

Remark 2: Theorem [2] for the special case C = C\ of 
singleton sets is sometimes known as the "Stam inequality" 
and has a long history. Stam [39] was the first to prove 
Proposition [2] for C\, and he credited his doctoral advisor 
de Bruijn with noticing the equivalence to Theorem [2] for 
C\. Subsequently several different proofs have appeared: in 
Blachman [7] using Lemma Q] in Carlen [9] using another 
superadditivity property of the Fisher information, and in 
Kagan [22] as a consequence of an inequality for Pitman 
estimators. On the other hand, the special case of the leave- 
one-out sets C = C n -i in Theorem[2]was first proved in ABBN 
[1]. Zamir [46] used data processing properties of the Fisher 
information to prove some different extensions of the C\ case, 
including a multivariate version; see also Liu and Viswanath 
[27] for some related interpretations. Our result for arbitrary 
collections of subsets is new; yet our proof of this general 
result is essentially no harder than the elementary proofs of 
the original inequality by Stam and Blachman. 



Remark 3: While inequality d29l ) in the proof above 
uses a Pythagorean inequality, one may use the associated 
Pythagorean identity to characterize the difference as the mean 
square of £ s wsPs — p{T n ). In the i.i.d. case with n = 2m 
and C a disjoint pair of subsets of size m, this drop in 
Fisher distance from the normal played an essential role in 
the previously mentioned CLT analyses of [37], [8], [4], [21]. 
Furthermore for general C, we have from the variance drop 
analysis that the gap in inequality d30b is characterized by 
the non-additive ANOVA components of the score functions. 
We point out these observations as an encouragement to 
examination of the information drop for collections such as 
C n -i and C n /2 in refined analysis of CLT rates under more 
general conditions. 

V. Entropy Inequalities 

The Fisher information inequality of the previous section 
yields a corresponding entropy inequality. 
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Proposition 3 (ENTROPY OF Sums): Let {X;} be inde- 
pendent random variables with densities. Then, for any prob- 
ability distribution w on C such that w$ < £ for each s, 



m E x < 

• ie[n] 



> 5>i, 



(33) 



sec v «es 

+ \H[w) - |logr, 

where H(w) = J2sec Ws 1°S w~ * s me discrete entropy of w. 
When C is discriminating, equality can hold only if each Xj 
is normal. 

Proof: As pointed out in the Introduction, Proposition [2] 
is equivalent to 



n 5> )^E^ 



Sies ^ 



ie[nj ' sec 

for independent random variables Y^. For application of the 
Fisher information inequality to entropy inequalities we also 
need for an independent standard normal Z that 



I(T n + yfi?Z) <J2 w sH- 



(34) 



sec 



at least for suitable values of ws- We will show below that 
this holds when rws < 1 for each s, and thus 



H{T n ) = \ log(27re) - § J (l(T [n] + ^Z) - 
ilog(27re)- 



1 



dt 




dt 



using d34l ) for the inequality and Lemma 3 for the equalities. 
By the scaling property of entropy, this implies 

i \ ^ , 



E< 

sec 



H(T n ) > > j w s H(T") 



2 / j 

sec 



w s log w s - h log r, 



proving the desired result. 

The inequality (l34l is true though not immediately so (the 
naive approach of adding an independent normal to each X, 
does not work to get our desired inequalities when the subsets 
have more than one element). What we need is to provide 
a collection of independent normal random variables Zj for 
some set of indices j (possibly many more than n of them). 
For each s in C we need an assignment of subset sums of Zj 
(called say Z s ) which has variance rws, such that no j is in 
more than r of the subsets s'. Then by Proposition [2] (applied 
to the collection C of augmented sets sUs' for each s in C) 
we have 

I(T n + VtZ) < rwll{T s + VtZ s ') 
sec 

from which the desired inequality follows using the fact that 
a 2 I(X) = I(X/a). Assuming that rws < 1 (which will be 



sufficient for our needs), we provide such a construction of Zj 
and their subset sums in the case of rational weights, say w s = 
W(s)/M, where the denominator M may be large. Indeed set 
Z\ , . . . , Zm independent mean-zero normals each of variance 
1 /M. For each s we construct a set s' that has precisely rW(s) 
normals and each normal is assigned to precisely r of these 
sets. This may be done systematically by considering the sets 
si, S2, . . . in C in some order. We let s[ be the the first rW(si) 
indices j (not more than M by assumption), we let sJ, be the 
next rW{&2) indices (looping back to the first index once we 
pass M) and so on. This proves the validity of d34T > for rational 
weights; its validity for general weights follows by continuity. 



Remark 4: One may re-express the inequality of Proposi- 
tion |3] as a statement for the relative entropies with respect to 
normals of the same variance. If X has density /, we write 
D{X) = E[log4S] = H{Z) - H{X), where Z has the 
Gaussian density g with the same variance as X. Then, for 
any probability distribution w on a balanced collection C, 



D 



ie[n] 



sec 



E*0 <Y, wsD \ E x + ^mw, 



ies 



(35) 



where 77 is the probability distribution on C given by T]s = 

rVai^T ) > anc * D(w\\r]) is the (discrete) relative entropy of w 
with respect to 77. When C is also discriminating, equality holds 
iff each Xj is normal. Theorem 1 of Tulino and Verdu [42] is 
the special case of inequality d35l ) for the collection of leave- 
one-out sets. Inequality d35l l can be further extended to the case 
where C is not balanced, but in that case 77 is a subprobability 
distribution. The conclusions d33b and ( f35l > are equivalent, and 
as seen in the next section, are equivalent to our subset sum 
entropy power inequality. 

Remark 5: It will become evident in the next section that 
the condition rws < 1 in Proposition [3] is not needed for the 
validity of the conclusions d33l l and d35l l (see Remark [8]). 

VI. Entropy power inequalities 
Proposition [3] is equivalent to a subset sum entropy power 

2H{X) 

inequality. Recall that the entropy power 27re is the variance 
of the normal with the same entropy as X. The term entropy 
power is also used for the quantity 



Af(X) = e 2H ^ x \ 
even when the constant factor of 2ne is excluded. 



(36) 



Theorem 3: For independent random variables with finite 
variances, 



sec x ies 



When C is discriminating, the two sides are equal iff each X, 
is normal and C is also balanced. 
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Proof: Let T s be the subset sums as defined in d25l ). De- 
fine Z = X)sec N(T S ) as the normalizing constant producing 
the weights 

W(T S ) 



A*s 



(38) 



If N(T S ) > Z/r for some s 6 C then we trivially have 
iV(T„) > iV(T s ) > Z/r which is the desired result, so now 
assume N(T S ) < Z/r for all s E C, that is, 7745 < 1 for each 
s. 

Since 

H(T( s )) = ilogAr(T s ) = Ilog[ Ms Z], 

Proposition [3] implies for any weighting distribution w with 
rws < 1 that 



H(T n ) > i^ Ws log[MsZ] 



sec 



H(w) — logr 



log D(w\\n) 



(39) 



where D(w\\fi) is the (discrete) relative entropy. Exponentiat- 
ing gives 

N(T n ) > r^e-^^Z. (40) 

It remains to optimize the right side over w, or equivalently 
to minimize D(w\\fj.) over feasible w. Since r/i s < 1 by 
assumption, w = fi is a feasible choice, yielding the desired 
inequality. 

The necessary conditions for equality follow from that for 
Proposition [3] and it is easily checked using Fact Q] that this 
is also sufficient. The proof is complete. ■ 



Remark 6: To understand this result fully, it is useful to 
note that if a discriminating collection C is not balanced, it can 
always be augmented to a new collection C that is balanced 
in such a way that the inequality ( 1371 ) for C becomes strictly 
better than that for C. Indeed, if index i appears in r(i) sets of 
C, one can always find r— r(i) sets of C not containing i (since 
r < |C|), and add i to each of these sets. The inequality d37] > for 
C' is strictly better since this collection has the same r and the 
subset sum entropy powers on the right side are higher due to 
the addition of independent random variables. While equality 
in (|37T i is impossible for the unbalanced collection C, it holds 
for normals for the augmented, balanced collection C . This 
illuminates the conditions for equality in Theorem [3] 



of the equivalence between the linear form of Proposition [3] 
and Theorem [3] reduces to the non-negativity of the relative 
entropy. 



Remark 8: To see that (l33l (and hence ( 1351 1) holds without 
any assumption on w, simply note that when the assumption 
is not satisfied, the entropy power inequality of Theorem [3] 
implies trivially that 



N(T n ) > r- Y Z > 



-L>(u)||p)2 



for n defined by 



and inverting the steps of 



yields 



VII. Entropy is Monotone on Average 

In this section, we consider the behavior of the entropy of 
sums of independent but not necessarily identically distributed 
(i.n.i.d.) random variables under various scalings. 

First we look at sums scaled according to the number of 
summands. Fix the collection C m = {sc[n]:|s|=m}. For 
i.n.i.d. random variables Xi, let 



Y„ 



and 



y(8) 



(41) 



yn v rn 

for s 6 C m be the scaled sums. Then Proposition [3] applied to 
C m implies 



H{Y n )> V w s H(Y&)-l 



sec 



log 



- H{w) 



The term on the right indicates that we pay a cost for 
deviations of the weighting distribution w from the uniform. 
In particular, choosing w to be uniform implies that entropy is 
"monotone on average" with uniform weights for scaled sums 
of i.n.i.d. random variables. Applying Theorem[3]to C m yields 
a similar conclusion for entropy power. These observations, 
which can also be deduced from the results of ABBN [1], are 
collected in Corollary [T] 



Corollary 1: Suppose are independent random variables 
with densities, and the scaled sums are defined by ( |4TT >. Then 

(sh 



H(Y n ) > JL £ H (Y r 
\m) sec m 

andAA(Y:„)>-L ]T M(Y r ^). 
\m) sec m 



(42) 



Remark 7: The traditional Shannon inequality involving the 
entropy powers of the summands [36] as well as the inequality 
of ABBN [1] involving the entropy powers of the "leave-one- 
out" normalized sums are two special cases of Theorem [3] 
corresponding to C = C\ and C = C n -\. Proofs of the former 
subsequent to Shannon's include those of Stam [39], Blachman 
[7], Lieb [26] (using Young's inequality for convolutions with 
the sharp constant), Dembo, Cover and Thomas [15] (building 
on Costa and Cover [13]), and Verdu and Guo [43]. Note that 
unlike the previous proofs of these special cases, our proof 



Remark 9: It is interesting to contrast Corollary Q] with the 
following results of Han [17] and Dembo, Cover and Thomas 
[15]. With no assumptions on (Xi, . . . , X n ) except that they 
have a joint density, the above authors show that 



H ( X [n]) 



< 



1 v H{X S ) 
( n ) ^ m 

\mJ SSC m 



and 



W( X ln])Y 



< 



E WW 



(43) 



(44) 



\m> sec,, 
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where H(Xs) and Af(Xs) denote the joint entropy and joint 
entropy power of Xg respectively. These bounds have a form 
very similar to that of Corollary [TJ In fact, such an analogy 
between inequalities for entropy of sums and joint entropies 
goes much deeper (so that all of the entropy power inequalities 
we present here have analogues for joint entropy). More details 
can be found in Madiman and Tetali [30]. 

Next, we consider sums of independent random variables 
standardized by their variances. This is motivated by the fol- 
lowing consideration. Consider a sequence of i.n.i.d. random 
variables {Xi : i e N} with zero mean and finite variances, 
af — Var(Xi). The variance of the sum of n variables is 
denoted v n = Eiefnl a i> an ^ tne standardized sum is 



E 



ie[n] 



Xi 



Yn 



The Lindeberg-Feller central limit theorem gives conditions 
under which V n => N(Q, 1). Johnson [20] has proved an 
entropic version of this classical theorem, showing (under 
appropriate conditions) that H(V n ) — > | log(27re) and hence 
the relative entropy from the unit normal tends to 0. Is there 
an analogue of the mono tonicity of information in this setting? 

We address this question in the following theorem, and 
give two proofs. The first proof is based on considering 
appropriately standardized linear combinations of independent 
random variables, and generalizes Theorem 2 of ABBN [1]. 
The second proof is outlined in Remark [TT1 



Theorem 4 (Monotonicity ON Average): Suppose 
{Xi : i € [n]} are independent random variables with 
densities, and Xi has finite variance of. Set v n = Eie[n] a i 
and vs — Eies °f f° r sets s m tne balanced collection C. 
Define the standardized sums 



and 



Then 



V n = 



En 



x, 



H(V n )>J2vsH(V^ 



(45) 



(46) 



(47) 



sec 



where 77s = ~r^. Furthermore, if C is also discriminating, 
then the inequality is strict unless each Xi is normal. 

Proof: Let ai, i G [n] be a collection of non-negative real 
numbers such that £™ =1 af — 1. Define as = [Eies a i]^ 
and the weights As = — for seC. Applying the inequalities 
of Theorem [2] Proposition [3] and Theorem [3] to independent 
random variables ciiX[, and utilizing the scaling properties of 
the relevant information quantities, one finds that 



/ n 
^ t=l 



> 



sec 



' tux! 



as 



its 



(48) 



where ip represents either the inverse Fisher information I^ 1 
or the entropy H or the entropy power Af. 

The conclusion of Theorem [4] is a particular instance of 
(l48l . Indeed, we can express the random variables of interest as 
Xi = CFiX^, so that each X[ has variance 1 . Choose Oj 
which is valid since J2ie[n] a i = Then a| = 
and A s = rj S . Thus 



es' 



,2 _ vs 



Vn = V = Y] ^X-, 

z — ' ■ /Vn Z — ' 

»6[n] 



and 



y(S) = ^ J^J_ = J_ ' 



ies 



diXi. 



Now an application of d48b gives the desired result, not just 
for H but also for Af and I~\ ■ 



Remark 10: Since the collection C is balanced, it follows 
from FactQjthat 77s defines a probability distribution on C. This 
justifies the interpretation of Theorem[4]as displaying "mono- 
tonicity on average". The averaging distribution r\ is tuned to 
the random variables of interest, through their variances. 



Remark 11: Theorem |4] also follows directly from 
upon setting ws — r)s and noting that the definition of D(X) 
is scale invariant (i.e., D(aX) = D(X) for any real number 
a). 

Let us briefly comment on the interpretation of this result. 
As discussed before, when the summands are i.i.d., entropic 
convergence of V n to the normal was shown in [4], and ABBN 
[1] showed that this sequence of entropies is monotonically in- 
creasing. This completes a long-conjectured intuitive picture of 
the central limit theorem: forming normalized sums that keep 
the variance constant yields random variables with increasing 
entropy, and this sequence of entropies converges to the 
maximum entropy possible, which is the entropy of the normal 
with that variance. In this sense, the CLT is a formulation of 
the "second law of thermodynamics" in physics. Theorem |4] 
above shows that even in the setting of variance-standardized 
sums of i.n.i.d. random variables, a general monotonicity on 
average property holds with respect to an arbitrary collection 
of normalized subset sums. This strengthens the "second law" 
interpretation of central limit theorems. 

A similar monotonicity on average property also holds 
for appropriate notions of Fisher information in convergence 
of sums of discrete random variables to the Poisson and 
compound Poisson distributions; details may be found in [29]. 

VIII. A Refined Inequality 

Various extensions of the basic inequalities presented above 
are possible; we present one here. To state it, we find it 
convenient to recall the notion of a fractional packing from 
discrete mathematics (see, e.g., Chung, Fiiredi, Garey and 
Graham [11]). 
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Definition 1: Let C be a collection of subsets of [n], A 
collection {/3s : s € C} of non-negative real numbers is called 
a fractional packing for C if 



E 

S9i,SeC 



/5s < 1 



(49) 



for each i in [n]. 



Note that if the numbers /3g are constrained to only take the 
values and 1, then the condition above entails that not more 
than one set in C can contain i, i.e., that the sets s S C are 
pairwise disjoint, and provide a packing of the set [n]. We may 
interpret a fractional packing as a "packing" of [n] using sets 
in C, each of which contains only a fractional piece (namely, 
/3s) of the elements in that set. 

We now present a refined version of Lemma |2] 



Lemma 4 ("VARIANCE DROP: GENERAL VERSION): 
Suppose U(Xi, . . . ,X n ) is a C-additive function with mean 
zero components, as in Lemma [2] Then 



sec 



(50) 



for any fractional packing {/3 S : s e C}. 

Proof: As in the proof of Lemma [2] we have 

EU 2 = J2 E E E tM*s) 
t SDt,sec 

We now proceed to perform a different estimation of this 
expression, recalling as in the proof of Lemma that the 
outer summation over t can be restricted to non-empty sets t. 
By the Cauchy-Schwarz inequality, 

-, 2 



E E tMx s ) 

SDt.sec 



E (VTs) 

SDt,sec 



< 



E 

SDt,sec 



Since any t of interest has at least one element, the definition 
of a fractional packing implies that 

E &<i. 

SDt,sec 



Thus 



^<E E ^E(E t MXs)Y 
t sdLsgc 

= E^-E s (^s(x s )) 2 

sec ps tcs 



(51) 



sec ^ s 



Exactly as before, one can obtain inequalities for Fisher 
information, entropy and entropy power based on this form of 



the variance drop lemma. The idea of looking for such refine- 
ments with coefficients depending on s arose in conversations 
with Tom Cover and Prasad Tetali at ISIT 2006 in Seattle, 
after our basic results described in the previous sections were 
presented. In particular, Prasad's joint work with one of us 
[30] influenced the development of Theorem [5] 

Theorem 5: Let {/3s : s £ C} be any fractional packing for 
C. Then 



X„ 



> 



E^ 



-i i \ ^ 



sec 



jes 



(52) 



For given subset sum informations, the best such lower 
bound on the information of the total sum would involve 
maximizing the right side of d52l subject to the linear con- 
straints (|49l . This linear programming problem, a version 
of which is the problem of optimal fractional packing well- 
studied in combinatorics (see, e.g., [11]), does not have an 
explicit solution in general. 

A natural choice of a fractional packing in Theorem [5] leads 
to the following corollary. 

Corollary 2: For any collection C of subsets of [n], let r(i) 
denote the number of sets in C that contain i. In the same 
setting as Theorem |2] we have 



1 



> 



y - 



(53) 



i(x 1 + ... + x n ) '-^Mi{Y,i^iY 

where r(s) is the maximum value of r(i) over the indices i 
in s. 

We say that C is quasibalanced if r(i) — r(s) for each i E s 
and each s. If C is discriminating, equality holds in (l53l l if the 
Xi are normal and C is quasibalanced. 



Remark 12: For any collection C and any seC, r(s) < r 
by definition. Thus Theorem [5] and Corollary [2] generalize 
Theorem |2] Furthermore, from the equality conditions in 
Corollary |2] we see that equality can hold in these more 
general inequalities even for collections C that are not bal- 
anced, which was not possible with the original formulation 
in Theorem |2] 

Remark 13: One can also give an alternate proof of Theo- 
rem |5] using Corollary [2] (which could be proved directly), so 
that the two results are mathematically equivalent. The key to 
doing this is the observation that nowhere in our proofs do we 
actually require that the sets s in C are distinct. In other words, 
given a collection C, one may look at an augmented collection 
that has fcs copies of each set s in C. Then the inequality 
(l53l holds for the augmented collection with the counts r(i) 
and r(s) appropriately modified. By considering arbitrary aug- 
mentations, one can obtain Theorem for fractional packings 
with rational coefficients. An approximation argument yields 
the full version. This method of proof, although picturesque, 
is somewhat less transparent in the details. 
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Remark 14: It is straightforward to extend Theorem [5] and 
Corollary |2] to the multivariate case, where X; are independent 
IR^-valued random vectors, and I(X) represents the trace of 
the Fisher information matrix of X. Similarly, extending Theo- 
rem |3] one obtains for independent M d -valued random vectors 
X\ , . . . , X n with densities and finite covariance matrices that 

2H(X 1 + ... + X„) 1 . , 2H( £jgS X 3> 

sec 

which implies the monotonicity of entropy for standardized 
sums of c?-dimensional random vectors. We leave the details 
to the reader. 

Remark 15: It is natural to speculate whether an analogous 
subset sum entropy power inequality holds with l/r(s) inside 
the sum. For each r between the minimum and the maximum 
of the r(s), we can split C into the sets C r = {s S C : r(s) = 
r}. Under an assumption that no one set s in C r dominates, that 
is, that there is no s* £ C r with N(T S ") > Esecv N(T S )/V, 
we are able to create suitable normals for perturbation of 
the Fisher information inequality and integrate (in the same 
manner as in the proof of Proposition to obtain 

' r (s) 
sec v ; 

The quasibalanced case (in which r(i) is the same for each i 
in s) is an interesting special case. Then the unions of sets in 
C r are disjoint for distinct values of r. So for quasibalanced 
collections the refined subset sum entropy power inequality 
( |54l always holds by combining our observation above with 
the Shannon-Stam entropy power inequality. 

IX. Concluding Remarks 

Our main contributions in this paper are rather general C- 
superadditivity inequalities for Fisher information and entropy 
power that hold for arbitrary collections C, and specialize 
to both the Shannon-Stam inequalities and the inequalities 
of ABBN [1]. In particular, we prove all these inequalities 
transparently using only simple projection facts, a variance 
drop lemma and classical information-theoretic ideas. A re- 
markable feature of our proofs is that their main ingredients 
are rather well-known, although our generalizations of the 
variance drop lemma appear to be new and are perhaps of 
independent interest. Both our results as well as the proofs 
lend themselves to intuitive statistical interpretations, several 
of which we have pointed out in the paper. We now point to 
potential directions of application. 

The inequalities of this paper are relevant to the study of 
central limit theorems, especially for i.n.i.d. random variables. 
Indeed, we demonstrated monotonicity on average properties 
in such settings. Moreover, most approaches to entropic central 
limit theorems involve a detailed quantification of the gaps 
associated with monotonicity properties of Fisher informa- 
tion when the summands are non-normal. Since the gap in 
our inequality is especially accessible due to our use of a 
Pythagorean property of projections (see Remark [3}, it could 
be of interest in obtaining transparent proofs of entropic central 



limit theorems in i.n.i.d. settings, and perhaps rate conclusions 
under less restrictive assumptions than those imposed in [21] 
and [2]. 

The new Fisher information inequalities we present are also 
of interest, because of the relationship of inverse Fisher infor- 
mation to asymptotically efficient estimation. In this context, 
the subset sum inequality can be interpreted as a comparison 
of an asymptotic mean squared error achieved with use of 
all Xi,...,X n , and the sum of the mean squared errors 
achieved in distributed estimation by sensors that observe 
(Xi, i £ s) for s £ C. The parameter of interest can either be a 
location parameter, or (following [21]) a natural parameter of 
exponential families for which the minimal sufficient statistics 
are sums. Furthermore, a non-asymptotic generalization of the 
new Fisher information inequalities holds (see [5] for details), 
which sheds light on minimax risks for estimation of a location 
parameter from sums. 

Entropy inequalities involving subsets of random variables 
(although traditionally not involving sums) have played an im- 
portant role in understanding some problems of graph theory. 
Radhakrishnan [33] provides a nice survey, and some recent 
developments (including joint entropy inequalities analogous 
to the entropy power inequalities in this paper) are discussed 
in [30]. The appearance of fractional packings in the refined 
inequality we present in Section [Vllll is particularly suggestive 
of further connections to be explored. 

In multi-user information theory, subset sums of rates and 
information quantities involving subsets of random variables 
are critical in characterizing rate regions of certain source 
and channel coding problems (e.g., m-user multiple access 
channels). Furthermore, there is a long history of the use of the 
classical entropy power inequality in the study of rate regions, 
see, e.g., Shannon [36], Bergmans [6], Ozarow [32], Costa [12] 
and Oohama [31]. For instance, the classical entropy power 
inequality was a key tool in Ozarow's solution of the Gaussian 
multiple description problem for two multiple descriptions, but 
seems to have been inadequate for problems involving three or 
more descriptions (see Wang and Viswanath [45] for a recent 
solution of one such problem without using the entropy power 
inequality). It seems natural to expand the set of tools available 
for investigation in these contexts. 

Appendix I 
The Analysis of Variance Decomposition 

In order to prove the variance drop lemma, we use a de- 
composition of functions in i 2 (R"), which is nothing but the 
Analysis of Variance (ANOVA) decomposition of a statistic. 
For any j £ [n], Ejijj denotes the conditional expectation of 
ip, given all random variables other than Xj, i.e., 

E j il>(x 1 ,...,x n )=E[1>{Xi,...,X n )\X i = x i Vi^j] (55) 
averages out the dependence on the j-th coordinate. 

Fact 2 fANOVA Decomposition): Suppose tp : M" — ► 
R satisfies Et(j 2 (X 1 , . . . , X n ) < oo, i.e., tp £ L 2 , for 
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independent random variables X\, X2, ■ ■ ■ , X n . For t C [n], 
define the orthogonal linear subspaces 

H t = {4 € L 2 : Ejil> = ipi mt} Vj € [n]} (56) 

of functions depending only on the variables indexed by 
t. Then L 2 is the orthogonal direct sum of this family of 
subspaces, i.e., any tp £ L 2 can be written in the form 

4' = J2 E t^ (57) 

tc[n] 

where E^ip G H^, and the subspaces (for t C [n]) are 
orthogonal to each other. 

Proof: Let E^ denote the integrating out of the variables 
in t, so that Ej = Eij\. Keeping in mind that the order of 
integrating out independent variables does not matter (i.e., the 
Ej are commuting projection operators in L 2 ), we can write 

n 

i> = J][Ei + {I-E j )]4> 
J=l 

= E II^IK 7 -^^ (58) 
tc[n]jgt 3 et 

= E E t^ 

tc[n] 

where 

E^^E^Wil-E^. (59) 
jet 

Note that if j is in t, EjE^ip — 0, being in the image of the 
operator Ej(I — Ej) = 0. If j is not in t, E^ip is already in 
the image of Ej, and a further application of the projection 
Ej has no effect. Thus E^ip is in fi^. 

Finally we wish to show that the subspaces are orthog- 
onal. For any distinct sets ti and in [n], there exists an 
index j which is in one (say ti), but not the other (say t2). 
Then, by definition, ij; is contained in the image of Ej and 
E%,ip is contained in the image of (/ — Ej). Hence E^ i/> is 
orthogonal to E^^ip. ■ 

Remark 16: In the language of ANOVA familiar to statis- 
ticians, when cj) is the empty set, E^tp is the mean; 
Esiytp) -£'{2}'0) ■ • • > Ej n \ip are the main effects; {E^ip : |t| = 
2} are the pairwise interactions, and so on. Fact [2] implies that 
for any subset s C [n], the function 53{t-tcs} is the best 
approximation (in mean square) to if> that depends only on the 
collection X s of random variables. 

Remark 17: The historical roots of this decomposition lie in 
the work of von Mises [44] and Hoeffding [18]. For various 
interpretations, see Kurkjian and Zelen [24], Jacobsen [19], 
Rubin and Vitale [34], Efron and Stein [16], Karlin and Rinott 
[23], and Steele [40]; these works include applications of 
such decompositions to experimental design, linear models, 
[/-statistics, and jackknife theory. Takemura [41] describes a 
general unifying framework for ANOVA decompositions. 
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